In this notebook I am doing EDA and data cleaning for extracted health dataset from MIMIC-IV dataset.
For the Exploratory Data Analysis (EDA), I have opted to aggregate the time series data from the last 8 hours based on their mean. The rationale behind this decision is as follows:
Sun et al. (2021) argue that the most recent physiological measurements are often the most relevant for predicting outcomes in ICU settings. Aggregating data from the last 8 hours provides a summary of the patient's current health status, which is crucial for predicting immediate risks like mortality.
Since the purpose of EDA for this dataset is to perform dimensionality reduction or feature selection in Assignment 4, it is necessary to convert the dataset from a time-series format into a more suitable format. BMC Medical Informatics and Decision Making (2021) stated that "using summary statistics such as the mean, median, and standard deviation reduces the complexity and noise of high-dimensional time series data, making it more suitable for predictive modeling".
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import entropy, norm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.stats import chi2
pd.set_option('display.max_columns', None)
mimic = pd.read_csv("health_dataset.csv")
data = mimic.copy()
display(data.head())
print(f"This dataset contains {data.shape[0]} rows and {data.shape[1]} columns.")
| SUBJECT_ID | HADM_ID | ICUSTAY_ID | AGE | GENDER | ETHNICITY | HEART_RATE | DIASBP | SYSTBP | MABP | TEMPERATURE | O2SAT | RESP_RATE | ALBUMIN | BUN | BILIRUBIN | LACTATE | BICARBONATE | BAND_NEUTROPHIL | CHLORIDE | CREATININE | GLUCOSE | HEMOGLOBIN | HEMATOCRIT | PLATELET_COUNT | POTASSIUM | PTT | SODIUM | WBC | HOSPITAL_EXPIRE_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18868417 | 21481340 | 36475723 | 77 | M | OTHER | 74.250 | NaN | NaN | NaN | 98.1125 | NaN | 18.375 | NaN | 33.375000 | NaN | 1.300000 | 27.875 | NaN | 105.0000 | 1.537500 | 105.250000 | 15.47500 | 49.75000 | 169.0000 | 4.000 | 75.287500 | 141.875 | 9.5625 | 0 |
| 1 | 17030222 | 22455492 | 30262362 | 85 | F | WHITE | 77.625 | 24.125 | 40.875 | NaN | 97.3875 | NaN | 12.250 | NaN | 25.400000 | NaN | 1.483333 | 26.000 | NaN | 106.2000 | 0.740000 | 120.600000 | 12.27500 | 37.45000 | 311.2500 | 3.940 | 43.066667 | 142.600 | 9.1500 | 1 |
| 2 | 15664318 | 21206710 | 38572720 | 44 | F | OTHER | 103.750 | 120.250 | 130.250 | NaN | 98.5125 | 97.0 | 20.625 | 2.6 | 22.250000 | 0.25 | 1.610000 | 26.500 | NaN | 102.6250 | 3.937500 | 98.250000 | 7.73750 | 26.21250 | 496.7500 | 4.575 | 32.000000 | 139.500 | 17.9750 | 0 |
| 3 | 13970700 | 20870207 | 38790488 | 91 | F | WHITE | 69.625 | 52.625 | 133.500 | NaN | 97.3625 | NaN | 17.750 | NaN | 16.833333 | NaN | NaN | 25.000 | NaN | 101.5625 | 0.641667 | 94.083333 | 11.03125 | 30.93125 | 212.8125 | 3.875 | 36.700000 | 133.500 | 7.7125 | 0 |
| 4 | 17035582 | 20664949 | 37164270 | 63 | M | WHITE | 81.625 | 42.875 | 97.000 | NaN | 98.2750 | NaN | 25.625 | NaN | 29.500000 | 0.50 | 2.450000 | 23.625 | NaN | 105.3750 | 0.900000 | 129.250000 | 11.18750 | 30.02500 | 218.9375 | 4.325 | 31.171429 | 140.125 | 11.3000 | 0 |
This dataset contains 22565 rows and 30 columns.
To aid our analysis, I am going to split the dataset into 2 separate dataframe, first one is for patients who did not expire 'data1', and the other for patients who expired 'data0'.
# Split the dataset
data0 = data[data['HOSPITAL_EXPIRE_FLAG'] == 0]
data1 = data[data['HOSPITAL_EXPIRE_FLAG'] == 1]
# Display the first few rows of each DataFrame to verify
print("Data0 (HOSPITAL_EXPIRE_FLAG == 0):")
display(data0.head())
print("Data1 (HOSPITAL_EXPIRE_FLAG == 1):")
display(data1.head())
# Print the shapes to confirm the number of rows in each DataFrame
print(f"Data0 contains {data0.shape[0]} rows and {data0.shape[1]} columns.")
print(f"Data1 contains {data1.shape[0]} rows and {data1.shape[1]} columns.")
Data0 (HOSPITAL_EXPIRE_FLAG == 0):
| SUBJECT_ID | HADM_ID | ICUSTAY_ID | AGE | GENDER | ETHNICITY | HEART_RATE | DIASBP | SYSTBP | MABP | TEMPERATURE | O2SAT | RESP_RATE | ALBUMIN | BUN | BILIRUBIN | LACTATE | BICARBONATE | BAND_NEUTROPHIL | CHLORIDE | CREATININE | GLUCOSE | HEMOGLOBIN | HEMATOCRIT | PLATELET_COUNT | POTASSIUM | PTT | SODIUM | WBC | HOSPITAL_EXPIRE_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18868417 | 21481340 | 36475723 | 77 | M | OTHER | 74.250 | NaN | NaN | NaN | 98.1125 | NaN | 18.375 | NaN | 33.375000 | NaN | 1.30 | 27.875 | NaN | 105.0000 | 1.537500 | 105.250000 | 15.47500 | 49.75000 | 169.0000 | 4.0000 | 75.287500 | 141.875 | 9.5625 | 0 |
| 2 | 15664318 | 21206710 | 38572720 | 44 | F | OTHER | 103.750 | 120.250 | 130.25 | NaN | 98.5125 | 97.0 | 20.625 | 2.600000 | 22.250000 | 0.250 | 1.61 | 26.500 | NaN | 102.6250 | 3.937500 | 98.250000 | 7.73750 | 26.21250 | 496.7500 | 4.5750 | 32.000000 | 139.500 | 17.9750 | 0 |
| 3 | 13970700 | 20870207 | 38790488 | 91 | F | WHITE | 69.625 | 52.625 | 133.50 | NaN | 97.3625 | NaN | 17.750 | NaN | 16.833333 | NaN | NaN | 25.000 | NaN | 101.5625 | 0.641667 | 94.083333 | 11.03125 | 30.93125 | 212.8125 | 3.8750 | 36.700000 | 133.500 | 7.7125 | 0 |
| 4 | 17035582 | 20664949 | 37164270 | 63 | M | WHITE | 81.625 | 42.875 | 97.00 | NaN | 98.2750 | NaN | 25.625 | NaN | 29.500000 | 0.500 | 2.45 | 23.625 | NaN | 105.3750 | 0.900000 | 129.250000 | 11.18750 | 30.02500 | 218.9375 | 4.3250 | 31.171429 | 140.125 | 11.3000 | 0 |
| 5 | 10457824 | 25280974 | 32230380 | 80 | M | UNKNOWN | 109.500 | NaN | NaN | NaN | 98.2875 | NaN | 28.375 | 2.166667 | 7.000000 | 0.925 | 1.20 | 28.125 | 6.2 | 99.2500 | 0.550000 | 164.375000 | 8.18750 | 24.77500 | 588.0000 | 4.0125 | 63.393750 | 138.750 | 19.0250 | 0 |
Data1 (HOSPITAL_EXPIRE_FLAG == 1):
| SUBJECT_ID | HADM_ID | ICUSTAY_ID | AGE | GENDER | ETHNICITY | HEART_RATE | DIASBP | SYSTBP | MABP | TEMPERATURE | O2SAT | RESP_RATE | ALBUMIN | BUN | BILIRUBIN | LACTATE | BICARBONATE | BAND_NEUTROPHIL | CHLORIDE | CREATININE | GLUCOSE | HEMOGLOBIN | HEMATOCRIT | PLATELET_COUNT | POTASSIUM | PTT | SODIUM | WBC | HOSPITAL_EXPIRE_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 17030222 | 22455492 | 30262362 | 85 | F | WHITE | 77.625000 | 24.125000 | 40.875000 | NaN | 97.387500 | NaN | 12.2500 | NaN | 25.400000 | NaN | 1.483333 | 26.000000 | NaN | 106.200000 | 0.740000 | 120.600000 | 12.27500 | 37.450000 | 311.250000 | 3.940000 | 43.066667 | 142.600000 | 9.15000 | 1 |
| 11 | 14405232 | 24340879 | 31011196 | 55 | F | UNKNOWN | 93.750000 | NaN | NaN | NaN | 97.522917 | NaN | 23.7500 | 3.666667 | 58.375000 | 35.6000 | 2.700000 | 20.125000 | 0.000 | 107.375000 | 2.150000 | 176.125000 | 7.90625 | 23.691667 | 81.687500 | 3.875000 | 50.325000 | 142.500000 | 20.06875 | 1 |
| 15 | 10641937 | 22804676 | 33298911 | 63 | M | WHITE | 99.979167 | 50.833333 | 72.666667 | NaN | 99.025000 | NaN | 22.8750 | 2.712500 | 91.250000 | 8.4625 | 2.248958 | 16.500000 | 0.875 | 108.875000 | 2.362500 | 124.250000 | 7.76250 | 22.987500 | 23.187500 | 4.650000 | 43.787500 | 139.250000 | 3.99375 | 1 |
| 26 | 12026338 | 28548230 | 32163537 | 68 | F | WHITE | 89.750000 | 62.000000 | 96.250000 | NaN | 99.775000 | 99.0 | 17.1250 | 2.600000 | 20.333333 | 0.4000 | 2.585000 | 16.604167 | NaN | 107.958333 | 1.552083 | 118.104167 | 10.34375 | 31.327083 | 95.770833 | 4.966667 | 43.018750 | 137.541667 | 4.41875 | 1 |
| 29 | 11439927 | 23639754 | 37817300 | 89 | F | WHITE | 80.166667 | 19.000000 | 34.500000 | NaN | 98.250000 | 74.0 | 17.4375 | 3.350000 | 84.125000 | 0.5000 | 2.800000 | 30.875000 | 3.000 | 90.625000 | 3.125000 | 259.500000 | 9.91250 | 32.575000 | 150.625000 | 3.650000 | 54.362500 | 136.500000 | 9.67500 | 1 |
Data0 contains 19286 rows and 30 columns. Data1 contains 3279 rows and 30 columns.
I am going to check missing values for the original 'data', 'data0', and 'data1'.
def Missing_Values(data):
var_name = []
data_type = []
total_value = []
total_missing_value = []
missing_value_pct = []
unique_value = []
total_unique_value = []
for col in data.columns:
var_name.append(col)
data_type.append(data[col].dtype)
total_value.append(data[col].shape[0])
total_missing_value.append(data[col].isnull().sum())
missing_value_pct.append(round(data[col].isnull().sum()/data[col].shape[0],3))
unique_value.append(data[col].unique())
total_unique_value.append(len(data[col].unique()))
missing_data=pd.DataFrame({"Variable":var_name,\
"Data_Type":data_type,\
"No_Total_Value":total_value,\
"No_Total_Missing_Value":total_missing_value,\
"Proportion_Missing_Value":missing_value_pct,\
"Unique_Value":unique_value,\
"Total_Unique_Value":total_unique_value
})
missing_data = missing_data.set_index("Variable")
return missing_data
# Checking missing values for the entire dataset
Missing_Values(data)
| Data_Type | No_Total_Value | No_Total_Missing_Value | Proportion_Missing_Value | Unique_Value | Total_Unique_Value | |
|---|---|---|---|---|---|---|
| Variable | ||||||
| SUBJECT_ID | int64 | 22565 | 0 | 0.000 | [18868417, 17030222, 15664318, 13970700, 17035... | 22565 |
| HADM_ID | int64 | 22565 | 0 | 0.000 | [21481340, 22455492, 21206710, 20870207, 20664... | 22565 |
| ICUSTAY_ID | int64 | 22565 | 0 | 0.000 | [36475723, 30262362, 38572720, 38790488, 37164... | 22565 |
| AGE | int64 | 22565 | 0 | 0.000 | [77, 85, 44, 91, 63, 80, 60, 72, 68, 33, 74, 5... | 73 |
| GENDER | object | 22565 | 0 | 0.000 | [M, F] | 2 |
| ETHNICITY | object | 22565 | 0 | 0.000 | [OTHER, WHITE, UNKNOWN, BLACK/AFRICAN AMERICAN... | 8 |
| HEART_RATE | float64 | 22565 | 1 | 0.000 | [74.25, 77.625, 103.75, 69.625, 81.625, 109.5,... | 2298 |
| DIASBP | float64 | 22565 | 10265 | 0.455 | [nan, 24.125, 120.25, 52.625, 42.875, 79.125, ... | 1870 |
| SYSTBP | float64 | 22565 | 10272 | 0.455 | [nan, 40.875, 130.25, 133.5, 97.0, 163.34375, ... | 2297 |
| MABP | float64 | 22565 | 19878 | 0.881 | [nan, 99.375, 71.625, 68.0, 78.1875, 68.166666... | 1137 |
| TEMPERATURE | float64 | 22565 | 94 | 0.004 | [98.1125, 97.3875, 98.5125, 97.3625, 98.275, 9... | 1240 |
| O2SAT | float64 | 22565 | 13816 | 0.612 | [nan, 97.0, 98.5, 96.5, 89.8, 96.375, 93.0, 99... | 383 |
| RESP_RATE | float64 | 22565 | 1 | 0.000 | [18.375, 12.25, 20.625, 17.75, 25.625, 28.375,... | 1345 |
| ALBUMIN | float64 | 22565 | 7511 | 0.333 | [nan, 2.6, 2.166666667, 3.308333333, 4.1, 3.6,... | 692 |
| BUN | float64 | 22565 | 0 | 0.000 | [33.375, 25.4, 22.25, 16.833333333, 29.5, 7.0,... | 2522 |
| BILIRUBIN | float64 | 22565 | 5788 | 0.257 | [nan, 0.25, 0.5, 0.925, 0.7375, 0.52, 0.7, 35.... | 1800 |
| LACTATE | float64 | 22565 | 3918 | 0.174 | [1.3, 1.483333333, 1.61, nan, 2.45, 1.2, 1.928... | 3005 |
| BICARBONATE | float64 | 22565 | 0 | 0.000 | [27.875, 26.0, 26.5, 25.0, 23.625, 28.125, 26.... | 1071 |
| BAND_NEUTROPHIL | float64 | 22565 | 16661 | 0.738 | [nan, 6.2, 0.0, 0.875, 0.5, 3.0, 6.4, 8.125, 6... | 344 |
| CHLORIDE | float64 | 22565 | 0 | 0.000 | [105.0, 106.2, 102.625, 101.5625, 105.375, 99.... | 1263 |
| CREATININE | float64 | 22565 | 4 | 0.000 | [1.5375, 0.74, 3.9375, 0.6416666667, 0.9, 0.55... | 1698 |
| GLUCOSE | float64 | 22565 | 0 | 0.000 | [105.25, 120.6, 98.25, 94.083333333, 129.25, 1... | 3975 |
| HEMOGLOBIN | float64 | 22565 | 0 | 0.000 | [15.475, 12.275, 7.7375, 11.03125, 11.1875, 8.... | 2844 |
| HEMATOCRIT | float64 | 22565 | 0 | 0.000 | [49.75, 37.45, 26.2125, 30.93125, 30.025, 24.7... | 5366 |
| PLATELET_COUNT | float64 | 22565 | 1 | 0.000 | [169.0, 311.25, 496.75, 212.8125, 218.9375, 58... | 8797 |
| POTASSIUM | float64 | 22565 | 0 | 0.000 | [4.0, 3.94, 4.575, 3.875, 4.325, 4.0125, 4.125... | 1036 |
| PTT | float64 | 22565 | 652 | 0.029 | [75.2875, 43.066666667, 32.0, 36.7, 31.1714285... | 8318 |
| SODIUM | float64 | 22565 | 0 | 0.000 | [141.875, 142.6, 139.5, 133.5, 140.125, 138.75... | 1098 |
| WBC | float64 | 22565 | 0 | 0.000 | [9.5625, 9.15, 17.975, 7.7125, 11.3, 19.025, 1... | 5288 |
| HOSPITAL_EXPIRE_FLAG | int64 | 22565 | 0 | 0.000 | [0, 1] | 2 |
# Checking missing values for patients who expired
Missing_Values(data0)
| Data_Type | No_Total_Value | No_Total_Missing_Value | Proportion_Missing_Value | Unique_Value | Total_Unique_Value | |
|---|---|---|---|---|---|---|
| Variable | ||||||
| SUBJECT_ID | int64 | 19286 | 0 | 0.000 | [18868417, 15664318, 13970700, 17035582, 10457... | 19286 |
| HADM_ID | int64 | 19286 | 0 | 0.000 | [21481340, 21206710, 20870207, 20664949, 25280... | 19286 |
| ICUSTAY_ID | int64 | 19286 | 0 | 0.000 | [36475723, 38572720, 38790488, 37164270, 32230... | 19286 |
| AGE | int64 | 19286 | 0 | 0.000 | [77, 44, 91, 63, 80, 60, 72, 68, 33, 74, 65, 4... | 73 |
| GENDER | object | 19286 | 0 | 0.000 | [M, F] | 2 |
| ETHNICITY | object | 19286 | 0 | 0.000 | [OTHER, WHITE, UNKNOWN, BLACK/AFRICAN AMERICAN... | 8 |
| HEART_RATE | float64 | 19286 | 1 | 0.000 | [74.25, 103.75, 69.625, 81.625, 109.5, 104.125... | 1627 |
| DIASBP | float64 | 19286 | 8997 | 0.467 | [nan, 120.25, 52.625, 42.875, 79.125, 50.25, 7... | 1488 |
| SYSTBP | float64 | 19286 | 9003 | 0.467 | [nan, 130.25, 133.5, 97.0, 163.34375, 123.75, ... | 1762 |
| MABP | float64 | 19286 | 17243 | 0.894 | [nan, 99.375, 71.625, 68.0, 78.1875, 68.166666... | 917 |
| TEMPERATURE | float64 | 19286 | 32 | 0.002 | [98.1125, 98.5125, 97.3625, 98.275, 98.2875, 9... | 1027 |
| O2SAT | float64 | 19286 | 12231 | 0.634 | [nan, 97.0, 98.5, 96.5, 89.8, 96.375, 93.0, 90... | 314 |
| RESP_RATE | float64 | 19286 | 1 | 0.000 | [18.375, 20.625, 17.75, 25.625, 28.375, 18.187... | 887 |
| ALBUMIN | float64 | 19286 | 6898 | 0.358 | [nan, 2.6, 2.166666667, 3.308333333, 4.1, 3.6,... | 611 |
| BUN | float64 | 19286 | 0 | 0.000 | [33.375, 22.25, 16.833333333, 29.5, 7.0, 25.25... | 2043 |
| BILIRUBIN | float64 | 19286 | 5439 | 0.282 | [nan, 0.25, 0.5, 0.925, 0.7375, 0.52, 0.7, 0.4... | 1240 |
| LACTATE | float64 | 19286 | 3748 | 0.194 | [1.3, 1.61, nan, 2.45, 1.2, 1.928125, 1.3375, ... | 2306 |
| BICARBONATE | float64 | 19286 | 0 | 0.000 | [27.875, 26.5, 25.0, 23.625, 28.125, 26.25, 29... | 854 |
| BAND_NEUTROPHIL | float64 | 19286 | 14944 | 0.775 | [nan, 6.2, 0.0, 0.5, 3.0, 5.125, 21.0, 9.4, 1.... | 286 |
| CHLORIDE | float64 | 19286 | 0 | 0.000 | [105.0, 102.625, 101.5625, 105.375, 99.25, 103... | 1046 |
| CREATININE | float64 | 19286 | 1 | 0.000 | [1.5375, 3.9375, 0.6416666667, 0.9, 0.55, 0.96... | 1400 |
| GLUCOSE | float64 | 19286 | 0 | 0.000 | [105.25, 98.25, 94.083333333, 129.25, 164.375,... | 3395 |
| HEMOGLOBIN | float64 | 19286 | 0 | 0.000 | [15.475, 7.7375, 11.03125, 11.1875, 8.1875, 9.... | 2625 |
| HEMATOCRIT | float64 | 19286 | 0 | 0.000 | [49.75, 26.2125, 30.93125, 30.025, 24.775, 28.... | 4886 |
| PLATELET_COUNT | float64 | 19286 | 1 | 0.000 | [169.0, 496.75, 212.8125, 218.9375, 588.0, 120... | 7879 |
| POTASSIUM | float64 | 19286 | 0 | 0.000 | [4.0, 4.575, 3.875, 4.325, 4.0125, 4.125, 3.76... | 864 |
| PTT | float64 | 19286 | 612 | 0.032 | [75.2875, 32.0, 36.7, 31.171428571, 63.39375, ... | 7219 |
| SODIUM | float64 | 19286 | 0 | 0.000 | [141.875, 139.5, 133.5, 140.125, 138.75, 139.8... | 892 |
| WBC | float64 | 19286 | 0 | 0.000 | [9.5625, 17.975, 7.7125, 11.3, 19.025, 13.9625... | 4423 |
| HOSPITAL_EXPIRE_FLAG | int64 | 19286 | 0 | 0.000 | [0] | 1 |
# Checking missing values for patients who survived
Missing_Values(data1)
| Data_Type | No_Total_Value | No_Total_Missing_Value | Proportion_Missing_Value | Unique_Value | Total_Unique_Value | |
|---|---|---|---|---|---|---|
| Variable | ||||||
| SUBJECT_ID | int64 | 3279 | 0 | 0.000 | [17030222, 14405232, 10641937, 12026338, 11439... | 3279 |
| HADM_ID | int64 | 3279 | 0 | 0.000 | [22455492, 24340879, 22804676, 28548230, 23639... | 3279 |
| ICUSTAY_ID | int64 | 3279 | 0 | 0.000 | [30262362, 31011196, 33298911, 32163537, 37817... | 3279 |
| AGE | int64 | 3279 | 0 | 0.000 | [85, 55, 63, 68, 89, 67, 44, 60, 66, 81, 58, 5... | 73 |
| GENDER | object | 3279 | 0 | 0.000 | [F, M] | 2 |
| ETHNICITY | object | 3279 | 0 | 0.000 | [WHITE, UNKNOWN, UNABLE TO OBTAIN, ASIAN, BLAC... | 8 |
| HEART_RATE | float64 | 3279 | 0 | 0.000 | [77.625, 93.75, 99.979166667, 89.75, 80.166666... | 1406 |
| DIASBP | float64 | 3279 | 1268 | 0.387 | [24.125, nan, 50.833333333, 62.0, 19.0, 56.653... | 874 |
| SYSTBP | float64 | 3279 | 1269 | 0.387 | [40.875, nan, 72.666666667, 96.25, 34.5, 82.57... | 1164 |
| MABP | float64 | 3279 | 2635 | 0.804 | [nan, 75.0, 79.625, 340.0, 82.8125, 102.0, 79.... | 468 |
| TEMPERATURE | float64 | 3279 | 62 | 0.019 | [97.3875, 97.522916667, 99.025, 99.775, 98.25,... | 706 |
| O2SAT | float64 | 3279 | 1585 | 0.483 | [nan, 99.0, 74.0, 95.0, 97.75, 97.333333333, 9... | 259 |
| RESP_RATE | float64 | 3279 | 0 | 0.000 | [12.25, 23.75, 22.875, 17.125, 17.4375, 24.843... | 857 |
| ALBUMIN | float64 | 3279 | 613 | 0.187 | [nan, 3.666666667, 2.7125, 2.6, 3.35, 2.3, 2.2... | 455 |
| BUN | float64 | 3279 | 0 | 0.000 | [25.4, 58.375, 91.25, 20.333333333, 84.125, 67... | 1535 |
| BILIRUBIN | float64 | 3279 | 349 | 0.106 | [nan, 35.6, 8.4625, 0.4, 0.5, 2.433333333, 17.... | 1081 |
| LACTATE | float64 | 3279 | 170 | 0.052 | [1.483333333, 2.7, 2.248958333, 2.585, 2.8, 2.... | 1461 |
| BICARBONATE | float64 | 3279 | 0 | 0.000 | [26.0, 20.125, 16.5, 16.604166667, 30.875, 22.... | 735 |
| BAND_NEUTROPHIL | float64 | 3279 | 1717 | 0.524 | [nan, 0.0, 0.875, 3.0, 6.4, 8.125, 6.0, 8.0, 1... | 231 |
| CHLORIDE | float64 | 3279 | 0 | 0.000 | [106.2, 107.375, 108.875, 107.958333333, 90.62... | 871 |
| CREATININE | float64 | 3279 | 3 | 0.001 | [0.74, 2.15, 2.3625, 1.552083333, 3.125, 1.616... | 1049 |
| GLUCOSE | float64 | 3279 | 0 | 0.000 | [120.6, 176.125, 124.25, 118.104166667, 259.5,... | 1895 |
| HEMOGLOBIN | float64 | 3279 | 0 | 0.000 | [12.275, 7.90625, 7.7625, 10.34375, 9.9125, 9.... | 1285 |
| HEMATOCRIT | float64 | 3279 | 0 | 0.000 | [37.45, 23.691666667, 22.9875, 31.327083333, 3... | 1998 |
| PLATELET_COUNT | float64 | 3279 | 0 | 0.000 | [311.25, 81.6875, 23.1875, 95.770833333, 150.6... | 2599 |
| POTASSIUM | float64 | 3279 | 0 | 0.000 | [3.94, 3.875, 4.65, 4.966666667, 3.65, 4.40833... | 695 |
| PTT | float64 | 3279 | 40 | 0.012 | [43.066666667, 50.325, 43.7875, 43.01875, 54.3... | 2569 |
| SODIUM | float64 | 3279 | 0 | 0.000 | [142.6, 142.5, 139.25, 137.541666667, 136.5, 1... | 788 |
| WBC | float64 | 3279 | 0 | 0.000 | [9.15, 20.06875, 3.99375, 4.41875, 9.675, 12.8... | 2306 |
| HOSPITAL_EXPIRE_FLAG | int64 | 3279 | 0 | 0.000 | [1] | 1 |
To aid with our EDA, the information about number of missing values for the whole dataset and per class is stored in the Excel file EDA -- 'Summary of Data Extraction Results.xlsx'
I decide to drop SUBJECT_ID, HADM_ID, ICUSTAY_ID from the dataset as they contain unique value for each patient.
# Drop specified columns from the data DataFrame
data = data.drop(columns=['SUBJECT_ID', 'HADM_ID', 'ICUSTAY_ID'])
# Display the first few rows of the modified data DataFrame
display(data.head())
# Print the shape of the modified data DataFrame
print(f"This dataset contains {data.shape[0]} rows and {data.shape[1]} columns.")
| AGE | GENDER | ETHNICITY | HEART_RATE | DIASBP | SYSTBP | MABP | TEMPERATURE | O2SAT | RESP_RATE | ALBUMIN | BUN | BILIRUBIN | LACTATE | BICARBONATE | BAND_NEUTROPHIL | CHLORIDE | CREATININE | GLUCOSE | HEMOGLOBIN | HEMATOCRIT | PLATELET_COUNT | POTASSIUM | PTT | SODIUM | WBC | HOSPITAL_EXPIRE_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 77 | M | OTHER | 74.250 | NaN | NaN | NaN | 98.1125 | NaN | 18.375 | NaN | 33.375000 | NaN | 1.300000 | 27.875 | NaN | 105.0000 | 1.537500 | 105.250000 | 15.47500 | 49.75000 | 169.0000 | 4.000 | 75.287500 | 141.875 | 9.5625 | 0 |
| 1 | 85 | F | WHITE | 77.625 | 24.125 | 40.875 | NaN | 97.3875 | NaN | 12.250 | NaN | 25.400000 | NaN | 1.483333 | 26.000 | NaN | 106.2000 | 0.740000 | 120.600000 | 12.27500 | 37.45000 | 311.2500 | 3.940 | 43.066667 | 142.600 | 9.1500 | 1 |
| 2 | 44 | F | OTHER | 103.750 | 120.250 | 130.250 | NaN | 98.5125 | 97.0 | 20.625 | 2.6 | 22.250000 | 0.25 | 1.610000 | 26.500 | NaN | 102.6250 | 3.937500 | 98.250000 | 7.73750 | 26.21250 | 496.7500 | 4.575 | 32.000000 | 139.500 | 17.9750 | 0 |
| 3 | 91 | F | WHITE | 69.625 | 52.625 | 133.500 | NaN | 97.3625 | NaN | 17.750 | NaN | 16.833333 | NaN | NaN | 25.000 | NaN | 101.5625 | 0.641667 | 94.083333 | 11.03125 | 30.93125 | 212.8125 | 3.875 | 36.700000 | 133.500 | 7.7125 | 0 |
| 4 | 63 | M | WHITE | 81.625 | 42.875 | 97.000 | NaN | 98.2750 | NaN | 25.625 | NaN | 29.500000 | 0.50 | 2.450000 | 23.625 | NaN | 105.3750 | 0.900000 | 129.250000 | 11.18750 | 30.02500 | 218.9375 | 4.325 | 31.171429 | 140.125 | 11.3000 | 0 |
This dataset contains 22565 rows and 27 columns.
target = ['HOSPITAL_EXPIRE_FLAG']
target_counts = data[target[0]].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', colors=['#66b3ff','#ff9999'], startangle=140)
plt.title(f'Pie Chart for {target[0]} (Percentage)')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
'HOSPITAL_EXPIRE_FLAG' indicates whether the patient died during the hospital stay (0 = did not expire, 1 = expired). The pie chart shows a heavily imbalanced target, with the value '0' comprising 86% of the total observations.
cat_cols = ['GENDER', 'ETHNICITY']
# Create bar charts with percentages
for col in cat_cols:
plt.figure(figsize=(6, 8))
ax = sns.countplot(data=data, x=col, palette='viridis')
total = len(data[col])
for p in ax.patches:
frequency = p.get_height()
percentage = f'{100 * frequency / total:.1f}%'
x = p.get_x() + p.get_width() / 2 - 0.1
y = p.get_height() + 0.005 * total
ax.annotate(f'{percentage}', (x, y), size=10, ha='center')
plt.title(f'Bar Chart for {col}')
plt.xlabel(col)
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
There is no missing values for 'GENDER' and 'ETHNICITY' features. There are only two genders in this dataset, with males making up 56.6% of all observations. For ethnicity, 'WHITE' dominates other ethnicities, making up 67% of all observations.
num_cols_information = [
{"Feature": "AGE", "Description": "The age of the patient.", "MIN": 0, "MAX": 91, "UNIT": "years"},
{"Feature": "HEART_RATE", "Description": "Heart Rate: The average heart rate of the patient.", "MIN": 0, "MAX": 350, "UNIT": "bpm"},
{"Feature": "DIASBP", "Description": "Diastolic Blood Pressure (diasbp): The average diastolic blood pressure of the patient.", "MIN": 0, "MAX": 375, "UNIT": "mmHg"},
{"Feature": "SYSTBP", "Description": "Systolic Blood Pressure (systbp): The average systolic blood pressure of the patient.", "MIN": 0, "MAX": 375, "UNIT": "mmHg"},
{"Feature": "MABP", "Description": "Mean Blood Pressure (meanbp): The average mean blood pressure of the patient.", "MIN": 14, "MAX": 330, "UNIT": "mmHg"},
{"Feature": "TEMPERATURE", "Description": "Temperature: The average body temperature of the patient.", "MIN": 78.8, "MAX": 113, "UNIT": "°F"},
{"Feature": "O2SAT", "Description": "Oxygen Saturation (o2sat): The average oxygen saturation of the patient.", "MIN": 0, "MAX": 100, "UNIT": "%"},
{"Feature": "RESP_RATE", "Description": "Respiratory Rate (resp_rate): The average respiratory rate of the patient.", "MIN": 0, "MAX": 300, "UNIT": "breaths per minute"},
{"Feature": "ALBUMIN", "Description": "Albumin: The average albumin level in the blood.", "MIN": 0.6, "MAX": 6, "UNIT": "g/dL"},
{"Feature": "BUN", "Description": "Blood Urea Nitrogen (bun): The average blood urea nitrogen level.", "MIN": 0, "MAX": 250, "UNIT": "mg/dL"},
{"Feature": "BILIRUBIN", "Description": "Bilirubin: The average bilirubin level.", "MIN": 0.1, "MAX": 60, "UNIT": "mg/dL"},
{"Feature": "LACTATE", "Description": "Lactate: The average lactate level in the blood.", "MIN": 0.4, "MAX": 30, "UNIT": "mmol/L"},
{"Feature": "BICARBONATE", "Description": "Bicarbonate: The average bicarbonate level.", "MIN": 0, "MAX": 60, "UNIT": "mEq/L"},
{"Feature": "CHLORIDE", "Description": "Chloride: The average chloride level.", "MIN": 50, "MAX": 175, "UNIT": "mEq/L"},
{"Feature": "CREATININE", "Description": "Creatinine: The average creatinine level.", "MIN": 0.1, "MAX": 60, "UNIT": "mg/dL"},
{"Feature": "GLUCOSE", "Description": "Glucose: The average glucose level.", "MIN": 33, "MAX": 2000, "UNIT": "mg/dL"},
{"Feature": "HEMOGLOBIN", "Description": "Hemoglobin: The average hemoglobin level.", "MIN": 0, "MAX": 25, "UNIT": "g/dL"},
{"Feature": "HEMATOCRIT", "Description": "Hematocrit: The average hematocrit level.", "MIN": 0, "MAX": 75, "UNIT": "g/dL"},
{"Feature": "PLATELET_COUNT", "Description": "Platelet Count: The average platelet count.", "MIN": 0, "MAX": 2000, "UNIT": "1000/mm³"},
{"Feature": "POTASSIUM", "Description": "Potassium: The average potassium level.", "MIN": 0, "MAX": 12, "UNIT": "mEq/L"},
{"Feature": "PTT", "Description": "Partial Thromboplastin Time (PTT): The average partial thromboplastin time.", "MIN": 18.8, "MAX": 150, "UNIT": "seconds"},
{"Feature": "SODIUM", "Description": "Sodium: The average sodium level.", "MIN": 50, "MAX": 225, "UNIT": "mEq/L"},
{"Feature": "WBC", "Description": "White Blood Cells: The average white blood cell count.", "MIN": 0, "MAX": 1000, "UNIT": "1000/mm³"},
{"Feature": "BAND_NEUTROPHIL", "Description": "Band Neutrophil: The percentage of band neutrophils.", "MIN": 0, "MAX": 100, "UNIT": "%"}
]
pd.DataFrame(num_cols_information)
| Feature | Description | MIN | MAX | UNIT | |
|---|---|---|---|---|---|
| 0 | AGE | The age of the patient. | 0.0 | 91 | years |
| 1 | HEART_RATE | Heart Rate: The average heart rate of the pati... | 0.0 | 350 | bpm |
| 2 | DIASBP | Diastolic Blood Pressure (diasbp): The average... | 0.0 | 375 | mmHg |
| 3 | SYSTBP | Systolic Blood Pressure (systbp): The average ... | 0.0 | 375 | mmHg |
| 4 | MABP | Mean Blood Pressure (meanbp): The average mean... | 14.0 | 330 | mmHg |
| 5 | TEMPERATURE | Temperature: The average body temperature of t... | 78.8 | 113 | °F |
| 6 | O2SAT | Oxygen Saturation (o2sat): The average oxygen ... | 0.0 | 100 | % |
| 7 | RESP_RATE | Respiratory Rate (resp_rate): The average resp... | 0.0 | 300 | breaths per minute |
| 8 | ALBUMIN | Albumin: The average albumin level in the blood. | 0.6 | 6 | g/dL |
| 9 | BUN | Blood Urea Nitrogen (bun): The average blood u... | 0.0 | 250 | mg/dL |
| 10 | BILIRUBIN | Bilirubin: The average bilirubin level. | 0.1 | 60 | mg/dL |
| 11 | LACTATE | Lactate: The average lactate level in the blood. | 0.4 | 30 | mmol/L |
| 12 | BICARBONATE | Bicarbonate: The average bicarbonate level. | 0.0 | 60 | mEq/L |
| 13 | CHLORIDE | Chloride: The average chloride level. | 50.0 | 175 | mEq/L |
| 14 | CREATININE | Creatinine: The average creatinine level. | 0.1 | 60 | mg/dL |
| 15 | GLUCOSE | Glucose: The average glucose level. | 33.0 | 2000 | mg/dL |
| 16 | HEMOGLOBIN | Hemoglobin: The average hemoglobin level. | 0.0 | 25 | g/dL |
| 17 | HEMATOCRIT | Hematocrit: The average hematocrit level. | 0.0 | 75 | g/dL |
| 18 | PLATELET_COUNT | Platelet Count: The average platelet count. | 0.0 | 2000 | 1000/mm³ |
| 19 | POTASSIUM | Potassium: The average potassium level. | 0.0 | 12 | mEq/L |
| 20 | PTT | Partial Thromboplastin Time (PTT): The average... | 18.8 | 150 | seconds |
| 21 | SODIUM | Sodium: The average sodium level. | 50.0 | 225 | mEq/L |
| 22 | WBC | White Blood Cells: The average white blood cel... | 0.0 | 1000 | 1000/mm³ |
| 23 | BAND_NEUTROPHIL | Band Neutrophil: The percentage of band neutro... | 0.0 | 100 | % |
num_cols = ['AGE', 'HEART_RATE', 'DIASBP', 'SYSTBP', 'MABP', 'TEMPERATURE', 'O2SAT', 'RESP_RATE', 'ALBUMIN', 'BUN', 'BILIRUBIN', 'LACTATE', 'BICARBONATE', 'CHLORIDE', 'CREATININE', 'GLUCOSE', 'HEMOGLOBIN', 'HEMATOCRIT', 'PLATELET_COUNT', 'POTASSIUM', 'PTT', 'SODIUM', 'WBC', 'BAND_NEUTROPHIL']
Next I am calculating how many observations fall outside the valid range in the dataset.
# For the entire observations
results = []
for col_info in num_cols_information:
feature = col_info["Feature"]
min_val = col_info["MIN"]
max_val = col_info["MAX"]
missing_count = data[feature].isnull().sum()
below_min_count = (data[feature] < min_val).sum()
above_max_count = (data[feature] > max_val).sum()
results.append({
"Feature": feature,
"Missing Values": missing_count,
"Below MIN": below_min_count,
"Above MAX": above_max_count
})
# Convert results to DataFrame
results_df = pd.DataFrame(results)
results_df["Total Valid Values"] = data.shape[0] - results_df["Missing Values"] - results_df["Below MIN"] - results_df["Above MAX"]
results_df
| Feature | Missing Values | Below MIN | Above MAX | Total Valid Values | |
|---|---|---|---|---|---|
| 0 | AGE | 0 | 0 | 0 | 22565 |
| 1 | HEART_RATE | 1 | 0 | 0 | 22564 |
| 2 | DIASBP | 10265 | 1 | 7 | 12292 |
| 3 | SYSTBP | 10272 | 0 | 0 | 12293 |
| 4 | MABP | 19878 | 36 | 4 | 2647 |
| 5 | TEMPERATURE | 94 | 10 | 8 | 22453 |
| 6 | O2SAT | 13816 | 0 | 11 | 8738 |
| 7 | RESP_RATE | 1 | 0 | 1 | 22563 |
| 8 | ALBUMIN | 7511 | 0 | 0 | 15054 |
| 9 | BUN | 0 | 0 | 0 | 22565 |
| 10 | BILIRUBIN | 5788 | 3 | 2 | 16772 |
| 11 | LACTATE | 3918 | 3 | 1 | 18643 |
| 12 | BICARBONATE | 0 | 0 | 0 | 22565 |
| 13 | CHLORIDE | 0 | 0 | 0 | 22565 |
| 14 | CREATININE | 4 | 2 | 0 | 22559 |
| 15 | GLUCOSE | 0 | 0 | 0 | 22565 |
| 16 | HEMOGLOBIN | 0 | 0 | 0 | 22565 |
| 17 | HEMATOCRIT | 0 | 0 | 0 | 22565 |
| 18 | PLATELET_COUNT | 1 | 0 | 0 | 22564 |
| 19 | POTASSIUM | 0 | 0 | 0 | 22565 |
| 20 | PTT | 652 | 3 | 0 | 21910 |
| 21 | SODIUM | 0 | 0 | 0 | 22565 |
| 22 | WBC | 0 | 0 | 0 | 22565 |
| 23 | BAND_NEUTROPHIL | 16661 | 0 | 0 | 5904 |
# For patients who expired
results = []
for col_info in num_cols_information:
feature = col_info["Feature"]
min_val = col_info["MIN"]
max_val = col_info["MAX"]
missing_count = data0[feature].isnull().sum()
below_min_count = (data0[feature] < min_val).sum()
above_max_count = (data0[feature] > max_val).sum()
results.append({
"Feature": feature,
"Missing Values": missing_count,
"Below MIN": below_min_count,
"Above MAX": above_max_count
})
# Convert results to DataFrame
results_df = pd.DataFrame(results)
results_df["Total Valid Values"] = data0.shape[0] - results_df["Missing Values"] - results_df["Below MIN"] - results_df["Above MAX"]
results_df
| Feature | Missing Values | Below MIN | Above MAX | Total Valid Values | |
|---|---|---|---|---|---|
| 0 | AGE | 0 | 0 | 0 | 19286 |
| 1 | HEART_RATE | 1 | 0 | 0 | 19285 |
| 2 | DIASBP | 8997 | 1 | 4 | 10284 |
| 3 | SYSTBP | 9003 | 0 | 0 | 10283 |
| 4 | MABP | 17243 | 28 | 3 | 2012 |
| 5 | TEMPERATURE | 32 | 3 | 8 | 19243 |
| 6 | O2SAT | 12231 | 0 | 8 | 7047 |
| 7 | RESP_RATE | 1 | 0 | 1 | 19284 |
| 8 | ALBUMIN | 6898 | 0 | 0 | 12388 |
| 9 | BUN | 0 | 0 | 0 | 19286 |
| 10 | BILIRUBIN | 5439 | 3 | 0 | 13844 |
| 11 | LACTATE | 3748 | 2 | 1 | 15535 |
| 12 | BICARBONATE | 0 | 0 | 0 | 19286 |
| 13 | CHLORIDE | 0 | 0 | 0 | 19286 |
| 14 | CREATININE | 1 | 2 | 0 | 19283 |
| 15 | GLUCOSE | 0 | 0 | 0 | 19286 |
| 16 | HEMOGLOBIN | 0 | 0 | 0 | 19286 |
| 17 | HEMATOCRIT | 0 | 0 | 0 | 19286 |
| 18 | PLATELET_COUNT | 1 | 0 | 0 | 19285 |
| 19 | POTASSIUM | 0 | 0 | 0 | 19286 |
| 20 | PTT | 612 | 2 | 0 | 18672 |
| 21 | SODIUM | 0 | 0 | 0 | 19286 |
| 22 | WBC | 0 | 0 | 0 | 19286 |
| 23 | BAND_NEUTROPHIL | 14944 | 0 | 0 | 4342 |
# For patients who did not expire
results = []
for col_info in num_cols_information:
feature = col_info["Feature"]
min_val = col_info["MIN"]
max_val = col_info["MAX"]
missing_count = data1[feature].isnull().sum()
below_min_count = (data1[feature] < min_val).sum()
above_max_count = (data1[feature] > max_val).sum()
results.append({
"Feature": feature,
"Missing Values": missing_count,
"Below MIN": below_min_count,
"Above MAX": above_max_count
})
# Convert results to DataFrame
results_df = pd.DataFrame(results)
results_df["Total Valid Values"] = data1.shape[0] - results_df["Missing Values"] - results_df["Below MIN"] - results_df["Above MAX"]
results_df
| Feature | Missing Values | Below MIN | Above MAX | Total Valid Values | |
|---|---|---|---|---|---|
| 0 | AGE | 0 | 0 | 0 | 3279 |
| 1 | HEART_RATE | 0 | 0 | 0 | 3279 |
| 2 | DIASBP | 1268 | 0 | 3 | 2008 |
| 3 | SYSTBP | 1269 | 0 | 0 | 2010 |
| 4 | MABP | 2635 | 8 | 1 | 635 |
| 5 | TEMPERATURE | 62 | 7 | 0 | 3210 |
| 6 | O2SAT | 1585 | 0 | 3 | 1691 |
| 7 | RESP_RATE | 0 | 0 | 0 | 3279 |
| 8 | ALBUMIN | 613 | 0 | 0 | 2666 |
| 9 | BUN | 0 | 0 | 0 | 3279 |
| 10 | BILIRUBIN | 349 | 0 | 2 | 2928 |
| 11 | LACTATE | 170 | 1 | 0 | 3108 |
| 12 | BICARBONATE | 0 | 0 | 0 | 3279 |
| 13 | CHLORIDE | 0 | 0 | 0 | 3279 |
| 14 | CREATININE | 3 | 0 | 0 | 3276 |
| 15 | GLUCOSE | 0 | 0 | 0 | 3279 |
| 16 | HEMOGLOBIN | 0 | 0 | 0 | 3279 |
| 17 | HEMATOCRIT | 0 | 0 | 0 | 3279 |
| 18 | PLATELET_COUNT | 0 | 0 | 0 | 3279 |
| 19 | POTASSIUM | 0 | 0 | 0 | 3279 |
| 20 | PTT | 40 | 1 | 0 | 3238 |
| 21 | SODIUM | 0 | 0 | 0 | 3279 |
| 22 | WBC | 0 | 0 | 0 | 3279 |
| 23 | BAND_NEUTROPHIL | 1717 | 0 | 0 | 1562 |
The observations with invalid values were likely purposefully marked during the de-identification phase of the dataset. I am going to turn all the invalid values into NaN (refer to the 'EDA -- Summary of Data Extraction Results' Excel file for the statistics).
# Process each column
for col_info in num_cols_information:
feature = col_info["Feature"]
min_val = col_info["MIN"]
max_val = col_info["MAX"]
# Replace values below min_val with min_val and values above max_val with max_val
data[feature] = np.where(data[feature] < min_val, np.nan, data[feature])
data[feature] = np.where(data[feature] > max_val, np.nan, data[feature])
# Calculate statistics
missing_count = data[feature].isnull().sum()
below_min_count = (data[feature] < min_val).sum()
above_max_count = (data[feature] > max_val).sum()
results.append({
"Feature": feature,
"Missing Values": missing_count,
"Below MIN": below_min_count,
"Above MAX": above_max_count
})
Now I am going to check if I have successfully converted all outside range values into NaN.
results = []
for col_info in num_cols_information:
feature = col_info["Feature"]
min_val = col_info["MIN"]
max_val = col_info["MAX"]
missing_count = data[feature].isnull().sum()
below_min_count = (data[feature] < min_val).sum()
above_max_count = (data[feature] > max_val).sum()
results.append({
"Feature": feature,
"Missing Values": missing_count,
"Below MIN": below_min_count,
"Above MAX": above_max_count
})
# Convert results to DataFrame
results_df = pd.DataFrame(results)
results_df["Total Valid Values"] = data.shape[0] - results_df["Missing Values"] - results_df["Below MIN"] - results_df["Above MAX"]
results_df
| Feature | Missing Values | Below MIN | Above MAX | Total Valid Values | |
|---|---|---|---|---|---|
| 0 | AGE | 0 | 0 | 0 | 22565 |
| 1 | HEART_RATE | 1 | 0 | 0 | 22564 |
| 2 | DIASBP | 10273 | 0 | 0 | 12292 |
| 3 | SYSTBP | 10272 | 0 | 0 | 12293 |
| 4 | MABP | 19918 | 0 | 0 | 2647 |
| 5 | TEMPERATURE | 112 | 0 | 0 | 22453 |
| 6 | O2SAT | 13827 | 0 | 0 | 8738 |
| 7 | RESP_RATE | 2 | 0 | 0 | 22563 |
| 8 | ALBUMIN | 7511 | 0 | 0 | 15054 |
| 9 | BUN | 0 | 0 | 0 | 22565 |
| 10 | BILIRUBIN | 5793 | 0 | 0 | 16772 |
| 11 | LACTATE | 3922 | 0 | 0 | 18643 |
| 12 | BICARBONATE | 0 | 0 | 0 | 22565 |
| 13 | CHLORIDE | 0 | 0 | 0 | 22565 |
| 14 | CREATININE | 6 | 0 | 0 | 22559 |
| 15 | GLUCOSE | 0 | 0 | 0 | 22565 |
| 16 | HEMOGLOBIN | 0 | 0 | 0 | 22565 |
| 17 | HEMATOCRIT | 0 | 0 | 0 | 22565 |
| 18 | PLATELET_COUNT | 1 | 0 | 0 | 22564 |
| 19 | POTASSIUM | 0 | 0 | 0 | 22565 |
| 20 | PTT | 655 | 0 | 0 | 21910 |
| 21 | SODIUM | 0 | 0 | 0 | 22565 |
| 22 | WBC | 0 | 0 | 0 | 22565 |
| 23 | BAND_NEUTROPHIL | 16661 | 0 | 0 | 5904 |
To check for distribution of each numerical feature I am going to use KL Test for divergence and also plot the histogram for all the numerical features.
# Store the results
kl_divergence_results = []
# Function to calculate KL divergence
def kl_divergence(p, q):
return entropy(p, q)
# Iterate over each numerical column
for col in num_cols:
col_data = data[col]
# Calculate the probability distributions
hist, bin_edges = np.histogram(col_data.dropna(), bins=30, density=True)
# Assume normal distribution as reference distribution
reference_distribution = norm.pdf(bin_edges[:-1], np.mean(col_data.dropna()), np.std(col_data.dropna()))
# Normalize the histogram to make it a valid probability distribution
hist /= np.sum(hist)
reference_distribution /= np.sum(reference_distribution)
# Calculate KL divergence
kl_div = kl_divergence(hist, reference_distribution)
kl_divergence_results.append({'Feature': col, 'KL Divergence': kl_div})
# Convert results to DataFrame
kl_divergence_df = pd.DataFrame(kl_divergence_results)
# Display KL divergence results
print(kl_divergence_df)
Feature KL Divergence 0 AGE 0.065410 1 HEART_RATE 0.040932 2 DIASBP 0.050021 3 SYSTBP 0.031161 4 MABP 0.233120 5 TEMPERATURE 0.293524 6 O2SAT 0.332655 7 RESP_RATE 0.256400 8 ALBUMIN 0.025676 9 BUN 0.300652 10 BILIRUBIN 1.247126 11 LACTATE 0.396055 12 BICARBONATE 0.053181 13 CHLORIDE 0.035973 14 CREATININE 0.531383 15 GLUCOSE 0.225293 16 HEMOGLOBIN 0.073793 17 HEMATOCRIT 0.080001 18 PLATELET_COUNT 0.156950 19 POTASSIUM 0.060117 20 PTT 0.355429 21 SODIUM 0.048403 22 WBC 0.471390 23 BAND_NEUTROPHIL 0.586817
# Define the number of columns for the plot
num_columns = 3
# Calculate the number of rows needed
num_rows = int(np.ceil(len(num_cols) / num_columns))
# Create a figure and a set of subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(15, num_rows * 5))
# Flatten the axes array for easier indexing
axs = axs.flatten()
# Iterate over each numerical column to plot histograms
for idx, col in enumerate(num_cols):
col_data = data[col]
# Plot histogram
sns.histplot(col_data, kde=True, bins=30, ax=axs[idx])
axs[idx].set_title(f'Histogram of {col}')
axs[idx].set_xlabel(col)
axs[idx].set_ylabel('Frequency')
# Remove any empty subplots
for idx in range(len(num_cols), len(axs)):
fig.delaxes(axs[idx])
plt.tight_layout()
plt.show()
According to the MIMIC-IV documentation, the type of missingness in the dataset is most likely MAR (Missing at Random) and MNAR (Missing Not at Random). MAR could be caused by certain patient demographics or clinical conditions leading to fewer recorded measurements due to specific clinical practices or priorities. Data could be missing because of the severity of a patient's condition (MNAR).
missing_values_heart_rate0 = data0['HEART_RATE'].isnull().sum()
print(f"Number of missing values in HEART_RATE for Class 0: {missing_values_heart_rate0}")
missing_values_heart_rate1 = data1['HEART_RATE'].isnull().sum()
print(f"Number of missing values in HEART_RATE for Class 1: {missing_values_heart_rate1}")
Number of missing values in HEART_RATE for Class 0: 1 Number of missing values in HEART_RATE for Class 1: 0
The distribution of 'HEART_RATE' is approximately normal, so I am going to replace the missing value with the mean value of heart rate from patients from Class 0.
# Calculate the mean of HEART_RATE from data0
mean_heart_rate = data0['HEART_RATE'].mean()
print(f"Mean HEART_RATE from data0: {mean_heart_rate}")
# Replace missing values in HEART_RATE of data with the mean from data0
data['HEART_RATE'].fillna(mean_heart_rate, inplace=True)
# Verify the replacement
missing_values_heart_rate_after = data['HEART_RATE'].isnull().sum()
print(f"Number of missing values in HEART_RATE after replacement: {missing_values_heart_rate_after}")
Mean HEART_RATE from data0: 83.74654869370609 Number of missing values in HEART_RATE after replacement: 0
The distribution of 'DIASBP' is right skewed. Within class 0, the percentage of missing values in DIASBP is 46.65%. Within class 1, the percentage of missing values in DIASBP is 38.67%.
The distribution of 'SYSTBP' is approximately normal. Within class 0, the percentage of missing values in SYSTBP is 46.68%. Within class 1, the percentage of missing values in SYSTBP is 38.70%.
Originally I was trying to replace the missing values in both of these features with the median from the respective classes. However when I did that in the earlier version of the EDA (not shown) in this notebook, it significantly alters the shape and distribution of both variables (as indicated by KL's divergence score before vs. after).
Another thing that I observe that by doing median imputation for these two features, I inflated the strength of the relationship between the two features.
Also notice the distribution of missing values for each class are significantly different from class 0 vs. class 1 for both features. This is a good indication that the data are most likely not missing at random. The percentage of missingness for each class between the two features are also identical, meaning most likely if a patient has no systolic blood pressure measured, then most likely s/he also will not have the diastolic blood pressure measurement.
Considering the arguments above, I decide to just drop the two features.
missing_values_DIASBP0 = data0['DIASBP'].isnull().sum()
print(f"Number of missing values in DIASBP for Class 0: {missing_values_DIASBP0}")
missing_values_DIASBP1 = data1['DIASBP'].isnull().sum()
print(f"Number of missing values in DIASBP for Class 1: {missing_values_DIASBP1}")
Number of missing values in DIASBP for Class 0: 8997 Number of missing values in DIASBP for Class 1: 1268
missing_values_SYSTBP0 = data0['SYSTBP'].isnull().sum()
print(f"Number of missing values in SYSTBP for Class 0: {missing_values_SYSTBP0}")
missing_values_SYSTBP1 = data1['SYSTBP'].isnull().sum()
print(f"Number of missing values in SYSTBP for Class 1: {missing_values_SYSTBP1}")
Number of missing values in SYSTBP for Class 0: 9003 Number of missing values in SYSTBP for Class 1: 1269
data = data.drop(columns=['DIASBP', 'SYSTBP'])
The percentage of missing values is too high, therefore we will drop these columns.
data = data.drop(columns=['MABP', 'O2SAT'])
There are 73.84% missing values in total (77.49% within class 0 and 52.36% within class 1). The percentage of missing values is too high, therefore we will drop this columns.
data = data.drop(columns=['BAND_NEUTROPHIL'])
The distribution for TEMPERATURE is right skewed. We will replace the missing values with the median. There are 0.42% of missing values from the total observations.
# Calculate the median of TEMPERATURE from data0
median_temperature = data0['TEMPERATURE'].median()
print(f"Median TEMPERATURE from data0: {median_temperature}")
# Replace missing TEMPERATURE values in data where HOSPITAL_EXPIRE_FLAG is 0
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['TEMPERATURE'].isnull())
data.loc[condition, 'TEMPERATURE'] = median_temperature
# Calculate the median of TEMPERATURE from data1
median_temperature = data1['TEMPERATURE'].median()
print(f"Median TEMPERATURE from data1: {median_temperature}")
# Replace missing TEMPERATURE values in data where HOSPITAL_EXPIRE_FLAG is 1
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 1) & (data['TEMPERATURE'].isnull())
data.loc[condition, 'TEMPERATURE'] = median_temperature
Median TEMPERATURE from data0: 98.3 Median TEMPERATURE from data1: 98.2625
The distribution for RESP_RATE is right skewed. We will replace the missing values with the median. There is only 1 missing value observation from class 0.
median_RESP_RATE = data0['RESP_RATE'].median()
print(f"Median RESP_RATE from data0: {median_RESP_RATE}")
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['RESP_RATE'].isnull())
data.loc[condition, 'RESP_RATE'] = median_RESP_RATE
Median RESP_RATE from data0: 19.625
The distribution for ALBUMIN is approximately normal. We will replace the missing values with the median. There are total of 33.29% of missing values. Within class 0, % missing is 35.77%. Within class 1, % missing is 18.69%.
# Calculate the median of ALBUMIN from data0
median_albumin = data0['ALBUMIN'].median()
print(f"Median ALBUMIN from data0: {median_albumin}")
# Replace missing ALBUMIN values in data where HOSPITAL_EXPIRE_FLAG is 0
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['ALBUMIN'].isnull())
data.loc[condition, 'ALBUMIN'] = median_albumin
# Calculate the median of ALBUMIN from data1
median_albumin = data1['ALBUMIN'].median()
print(f"Median ALBUMIN from data1: {median_albumin}")
# Replace missing ALBUMIN values in data where HOSPITAL_EXPIRE_FLAG is 1
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 1) & (data['ALBUMIN'].isnull())
data.loc[condition, 'ALBUMIN'] = median_albumin
Median ALBUMIN from data0: 3.15 Median ALBUMIN from data1: 2.7875
The distribution for BILIRUBIN is right skewed. We will replace the missing values with the median. There are total of 25.65% of missing values. Within class 0, % missing is 28.20%. Within class 1, % missing is 10.64%.
# Calculate the median of BILIRUBIN from data0
median_bilirubin = data0['BILIRUBIN'].median()
print(f"Median BILIRUBIN from data0: {median_bilirubin}")
# Replace missing BILIRUBIN values in data where HOSPITAL_EXPIRE_FLAG is 0
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['BILIRUBIN'].isnull())
data.loc[condition, 'BILIRUBIN'] = median_bilirubin
# Calculate the median of BILIRUBIN from data1
median_bilirubin = data1['BILIRUBIN'].median()
print(f"Median BILIRUBIN from data1: {median_bilirubin}")
# Replace missing BILIRUBIN values in data where HOSPITAL_EXPIRE_FLAG is 1
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 1) & (data['BILIRUBIN'].isnull())
data.loc[condition, 'BILIRUBIN'] = median_bilirubin
Median BILIRUBIN from data0: 0.6 Median BILIRUBIN from data1: 0.9
The distribution for LACTATE is right skewed. We will replace the missing values with the median. There are total of 17.36% of missing values. Within class 0, % missing is 19.43%. Within class 1, % missing is 5.18%.
# Calculate the median of LACTATE from data0
median_lactate = data0['LACTATE'].median()
print(f"Median LACTATE from data0: {median_lactate}")
# Replace missing LACTATE values in data where HOSPITAL_EXPIRE_FLAG is 0
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['LACTATE'].isnull())
data.loc[condition, 'LACTATE'] = median_lactate
# Calculate the median of LACTATE from data1
median_lactate = data1['LACTATE'].median()
print(f"Median LACTATE from data1: {median_lactate}")
# Replace missing LACTATE values in data where HOSPITAL_EXPIRE_FLAG is 1
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 1) & (data['LACTATE'].isnull())
data.loc[condition, 'LACTATE'] = median_lactate
Median LACTATE from data0: 1.55 Median LACTATE from data1: 2.0
The distribution for CREATININE is right skewed. We will replace the missing values with the median. There are total of 0.02% of missing values.
# Calculate the median of CREATININE from data0
median_creatinine = data0['CREATININE'].median()
print(f"Median CREATININE from data0: {median_creatinine}")
# Replace missing CREATININE values in data where HOSPITAL_EXPIRE_FLAG is 0
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['CREATININE'].isnull())
data.loc[condition, 'CREATININE'] = median_creatinine
# Calculate the median of CREATININE from data1
median_creatinine = data1['CREATININE'].median()
print(f"Median CREATININE from data1: {median_creatinine}")
# Replace missing CREATININE values in data where HOSPITAL_EXPIRE_FLAG is 1
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 1) & (data['CREATININE'].isnull())
data.loc[condition, 'CREATININE'] = median_creatinine
Median CREATININE from data0: 0.875 Median CREATININE from data1: 1.375
The distribution for PLATELET_COUNT is right skewed. We will replace the missing values with the median. There is only 1 missing value from class 0.
median_PLATELET_COUNT = data0['PLATELET_COUNT'].median()
print(f"Median PLATELET_COUNT from data0: {median_PLATELET_COUNT}")
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['PLATELET_COUNT'].isnull())
data.loc[condition, 'PLATELET_COUNT'] = median_PLATELET_COUNT
Median PLATELET_COUNT from data0: 219.0
The distribution for PTT is right skewed. We will replace the missing values with the median. There are total of 2.89% of missing values.
# Calculate the median of PTT from data0
median_ptt = data0['PTT'].median()
print(f"Median PTT from data0: {median_ptt}")
# Replace missing PTT values in data where HOSPITAL_EXPIRE_FLAG is 0
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 0) & (data['PTT'].isnull())
data.loc[condition, 'PTT'] = median_ptt
# Calculate the median of PTT from data1
median_ptt = data1['PTT'].median()
print(f"Median PTT from data1: {median_ptt}")
# Replace missing PTT values in data where HOSPITAL_EXPIRE_FLAG is 1
condition = (data['HOSPITAL_EXPIRE_FLAG'] == 1) & (data['PTT'].isnull())
data.loc[condition, 'PTT'] = median_ptt
Median PTT from data0: 31.6 Median PTT from data1: 38.75
### Making sure there is no more missing values
data.isnull().sum()
AGE 0 GENDER 0 ETHNICITY 0 HEART_RATE 0 TEMPERATURE 0 RESP_RATE 0 ALBUMIN 0 BUN 0 BILIRUBIN 0 LACTATE 0 BICARBONATE 0 CHLORIDE 0 CREATININE 0 GLUCOSE 0 HEMOGLOBIN 0 HEMATOCRIT 0 PLATELET_COUNT 0 POTASSIUM 0 PTT 0 SODIUM 0 WBC 0 HOSPITAL_EXPIRE_FLAG 0 dtype: int64
After dealing with missing values I will once again recalculate the KL divergence to make sure the distribution is not significantly altered.
cols = list(data.columns)
num_cols = ['AGE','HEART_RATE','TEMPERATURE', 'RESP_RATE', 'ALBUMIN', 'BUN', 'BILIRUBIN', 'LACTATE', 'BICARBONATE', 'CHLORIDE', 'CREATININE', 'GLUCOSE', 'HEMOGLOBIN', 'HEMATOCRIT', 'PLATELET_COUNT', 'POTASSIUM', 'PTT', 'SODIUM', 'WBC', 'HOSPITAL_EXPIRE_FLAG']
# Store the results
kl_divergence_results = []
# Function to calculate KL divergence
def kl_divergence(p, q):
return entropy(p, q)
# Iterate over each numerical column
for col in num_cols:
col_data = data[col]
# Calculate the probability distributions
hist, bin_edges = np.histogram(col_data.dropna(), bins=30, density=True)
# Assume normal distribution as reference distribution
reference_distribution = norm.pdf(bin_edges[:-1], np.mean(col_data.dropna()), np.std(col_data.dropna()))
# Normalize the histogram to make it a valid probability distribution
hist /= np.sum(hist)
reference_distribution /= np.sum(reference_distribution)
# Calculate KL divergence
kl_div = kl_divergence(hist, reference_distribution)
kl_divergence_results.append({'Feature': col, 'KL Divergence': kl_div})
# Convert results to DataFrame
kl_divergence_df = pd.DataFrame(kl_divergence_results)
# Display KL divergence results
print(kl_divergence_df)
Feature KL Divergence 0 AGE 0.065410 1 HEART_RATE 0.040935 2 TEMPERATURE 0.294249 3 RESP_RATE 0.256393 4 ALBUMIN 0.263427 5 BUN 0.300652 6 BILIRUBIN 1.260128 7 LACTATE 0.474169 8 BICARBONATE 0.053181 9 CHLORIDE 0.035973 10 CREATININE 0.531361 11 GLUCOSE 0.225293 12 HEMOGLOBIN 0.073793 13 HEMATOCRIT 0.080001 14 PLATELET_COUNT 0.156951 15 POTASSIUM 0.060117 16 PTT 0.373272 17 SODIUM 0.048403 18 WBC 0.471390 19 HOSPITAL_EXPIRE_FLAG 2.927277
num_cols_no_target = ['AGE','HEART_RATE','TEMPERATURE', 'RESP_RATE', 'ALBUMIN', 'BUN', 'BILIRUBIN', 'LACTATE', 'BICARBONATE', 'CHLORIDE', 'CREATININE', 'GLUCOSE', 'HEMOGLOBIN', 'HEMATOCRIT', 'PLATELET_COUNT', 'POTASSIUM', 'PTT', 'SODIUM', 'WBC']
plt.figure(figsize=(13, 10))
sns.pairplot(data[num_cols],
diag_kind='hist', corner=True, diag_kws={'color': 'red'})
plt.yticks(rotation=0)
plt.show()
<Figure size 1300x1000 with 0 Axes>
plt.figure(figsize=(30, 30))
corr = data.corr() # correlation
mask = np.triu(np.ones_like(corr, dtype=bool)) # masking the upper triangle
np.fill_diagonal(mask, False) # diagonal 1s
# heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm', mask=mask)
plt.yticks(rotation=0)
plt.show()
C:\Users\andre\AppData\Local\Temp\ipykernel_20324\1926582375.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = data.corr() # correlation
df_with_const = add_constant(data[num_cols_no_target])
vif = pd.DataFrame()
vif["feature"] = df_with_const.columns
vif["VIF"] = [variance_inflation_factor(df_with_const.values, i) for i in range(df_with_const.shape[1])]
3
vif
| feature | VIF | |
|---|---|---|
| 0 | const | 14685.458965 |
| 1 | AGE | 1.207572 |
| 2 | HEART_RATE | 1.166075 |
| 3 | TEMPERATURE | 1.096312 |
| 4 | RESP_RATE | 1.106975 |
| 5 | ALBUMIN | 1.208824 |
| 6 | BUN | 2.131926 |
| 7 | BILIRUBIN | 1.209277 |
| 8 | LACTATE | 1.294945 |
| 9 | BICARBONATE | 2.748871 |
| 10 | CHLORIDE | 5.180620 |
| 11 | CREATININE | 1.900004 |
| 12 | GLUCOSE | 1.098345 |
| 13 | HEMOGLOBIN | 16.071140 |
| 14 | HEMATOCRIT | 15.795112 |
| 15 | PLATELET_COUNT | 1.239323 |
| 16 | POTASSIUM | 1.215595 |
| 17 | PTT | 1.082323 |
| 18 | SODIUM | 4.122088 |
| 19 | WBC | 1.110443 |
from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2
from sklearn.preprocessing import StandardScaler
# Calculate the mean and covariance matrix on the scaled data
mean = np.mean(data[num_cols], axis=0)
cov = np.cov(data[num_cols], rowvar=False)
inv_covmat = np.linalg.inv(cov)
# Calculate Mahalanobis distance for each observation
def mahalanobis_distance(row, mean, inv_covmat):
diff = row - mean
return np.sqrt(np.dot(np.dot(diff, inv_covmat), diff.T))
data['mahalanobis'] = pd.DataFrame(data[num_cols], columns=num_cols).apply(lambda row: mahalanobis_distance(row, mean, inv_covmat), axis=1)
# Determine the threshold for the 99% confidence interval
threshold = chi2.ppf(0.99, len(num_cols))
# Identify outliers (note: using np.sqrt(threshold) for the distance)
outliers = data[data['mahalanobis'] > np.sqrt(threshold)]
# Print the outliers DataFrame
print("Outliers:")
print(outliers)
Outliers:
AGE GENDER ETHNICITY HEART_RATE TEMPERATURE \
11 55.0 F UNKNOWN 93.750000 97.522917
29 89.0 F WHITE 80.166667 98.250000
36 44.0 F UNKNOWN 87.750000 97.887500
38 60.0 M WHITE 85.357143 96.562500
66 53.0 F WHITE 74.500000 98.112500
... ... ... ... ... ...
22515 83.0 F BLACK/AFRICAN AMERICAN 45.375000 95.912500
22533 65.0 F UNKNOWN 126.609375 96.287500
22545 60.0 M WHITE 86.016667 99.800000
22555 66.0 F UNKNOWN 106.750000 98.100000
22564 76.0 M WHITE 108.125000 99.387500
RESP_RATE ALBUMIN BUN BILIRUBIN LACTATE BICARBONATE \
11 23.750000 3.666667 58.375000 35.600000 2.700000 20.125000
29 17.437500 3.350000 84.125000 0.500000 2.800000 30.875000
36 16.125000 2.242857 11.187500 17.175000 3.256250 16.250000
38 27.000000 2.200000 33.000000 14.212500 10.187500 10.812500
66 9.000000 4.760000 84.937500 35.275000 4.033333 13.750000
... ... ... ... ... ... ...
22515 24.312500 1.700000 22.416667 1.412500 3.081250 16.270833
22533 12.937500 2.350000 32.714286 1.900000 7.162500 11.187500
22545 29.741667 3.000000 44.125000 1.128571 4.012500 16.750000
22555 12.000000 3.400000 49.500000 4.637500 1.210417 20.750000
22564 17.125000 1.850000 74.875000 0.866667 3.587500 13.875000
CHLORIDE CREATININE GLUCOSE HEMOGLOBIN HEMATOCRIT \
11 107.375000 2.150000 176.125000 7.906250 23.691667
29 90.625000 3.125000 259.500000 9.912500 32.575000
36 107.875000 0.725000 167.062500 8.495833 25.897917
38 98.125000 1.975000 174.062500 8.518750 24.018750
66 87.250000 5.793750 118.937500 7.875000 23.462500
... ... ... ... ... ...
22515 112.833333 1.283333 121.645833 11.300000 36.444444
22533 98.375000 2.642857 211.875000 7.707143 25.578571
22545 104.500000 3.737500 164.000000 9.100000 27.206250
22555 91.625000 3.750000 115.750000 9.200000 26.381250
22564 102.375000 2.525000 230.125000 9.475000 29.375000
PLATELET_COUNT POTASSIUM PTT SODIUM WBC \
11 81.687500 3.875000 50.32500 142.500000 20.068750
29 150.625000 3.650000 54.36250 136.500000 9.675000
36 93.354167 3.743750 51.96250 139.937500 38.395833
38 43.875000 4.550000 94.10625 131.937500 6.243750
66 78.250000 3.856250 60.45000 127.375000 9.687500
... ... ... ... ... ...
22515 88.555556 4.400000 57.80000 136.041667 12.194444
22533 154.214286 4.143750 74.65000 138.000000 19.471429
22545 70.083333 5.428571 78.68125 138.428571 16.812500
22555 469.125000 4.237500 64.25000 133.625000 33.187500
22564 83.000000 3.925000 35.55625 129.000000 25.975000
HOSPITAL_EXPIRE_FLAG mahalanobis
11 1 9.852989
29 1 6.170727
36 1 6.919341
38 1 9.486908
66 1 11.231063
... ... ...
22515 1 7.057887
22533 1 8.731445
22545 1 6.201820
22555 1 6.876449
22564 1 6.807305
[1993 rows x 23 columns]
The Mahalanobis distance has identified 1993 observations that are considered multidimensional outliers.
data.to_csv('cleaned_health.csv', index=False)